Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754218AbYCIL0C (ORCPT ); Sun, 9 Mar 2008 07:26:02 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752765AbYCILZw (ORCPT ); Sun, 9 Mar 2008 07:25:52 -0400 Received: from hobbit.corpit.ru ([81.13.94.6]:24396 "EHLO hobbit.corpit.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752406AbYCILZv (ORCPT ); Sun, 9 Mar 2008 07:25:51 -0400 Message-ID: <47D3C93D.8070204@msgid.tls.msk.ru> Date: Sun, 09 Mar 2008 14:25:49 +0300 From: Michael Tokarev Organization: Telecom Service, JSC User-Agent: Mozilla-Thunderbird 2.0.0.9 (X11/20080110) MIME-Version: 1.0 To: Linux-kernel , SCSI Mailing List Subject: Re: kernel BUG at drivers/scsi/aic7xxx/aic79xx_osm.c:1490! References: <47D3C8A1.6040409@msgid.tls.msk.ru> In-Reply-To: <47D3C8A1.6040409@msgid.tls.msk.ru> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4177 Lines: 96 Michael Tokarev wrote: > Just got quite.. bad situation on a production server > here. The machine locked up hard several times in a > row (required hard reboot). So I finally enabled watchdog > subsystem which helped. > > Now I see the following (over netconsole): Forgot the most important information. # uname -a Linux tbus90.msk.rgs-podm.ru 2.6.24-x86-64 #2.6.24.2 SMP Mon Feb 18 16:04:41 MSK 2008 x86_64 GNU/Linux It's mostly vanilla 2.6.24.2, with some irrelevant patches like unionfs (not even loaded). > DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:08:07.0 > ------------[ cut here ]------------ > kernel BUG at drivers/scsi/aic7xxx/aic79xx_osm.c:1490! > invalid opcode: 0000 [1] SMP > CPU 0 > Modules linked in: xfs netconsole nfsd lockd nfs_acl sunrpc exportfs > autofs4 iTCO_wdt iTCO_vendor_support raid10 raid0 sr_mod cdrom ata_piix > libata tg3 mptspi mptscsih mptbase ext3 jbd mbcache raid1 md_mod sd_mod > aic79xx scsi_transport_spi scsi_mod > Pid: 2176, comm: gzip Not tainted 2.6.24-x86-64 #2.6.24.2 > RIP: 0010:[] [] > :aic79xx:ahd_linux_queue+0x58a/0x590 > RSP: 0000:ffffffff80511d40 EFLAGS: 00010082 > RAX: 00000000fffffff4 RBX: ffff81018c331600 RCX: 00000000fffffff4 > RDX: ffff8100063660e0 RSI: 0000000000000002 RDI: ffffffff804a2150 > RBP: ffff8101a9029e40 R08: 0000000000000044 R09: 0000000000000000 > R10: 00000000fffffff4 R11: ffffffff80222d80 R12: ffff8101aff8d418 > R13: ffff8101aeea7000 R14: ffff8101aef50000 R15: ffff8101aeea78b4 > FS: 0000000000000000(0000) GS:ffffffff804b7000(0063) > knlGS:00000000f7de56b0 > CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b > CR2: 0000000008065000 CR3: 00000001adbb8000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process gzip (pid: 2176, threadinfo ffff8101a9270000, task > ffff8101a91b2000) > Stack: ffff8101aff8d000 0000000000000083 0000000000000220 ffffffff80245435 > ffff81014ec656c0 0000000000000293 ffff8101aff8d000 ffff81018c331600 > ffff8101aef48800 ffff81018c331600 ffff8101aff8d048 ffffffff8800100c > Call Trace: > [] __mod_timer+0xb5/0xd0 > [] :scsi_mod:scsi_dispatch_cmd+0x17c/0x2e0 > [] :scsi_mod:scsi_request_fn+0x225/0x3d0 > [] blk_run_queue+0x43/0x80 > [] :scsi_mod:scsi_next_command+0x3b/0x60 > [] :scsi_mod:scsi_end_request+0xd5/0x110 > [] :scsi_mod:scsi_io_completion+0xae/0x3e0 > [] blk_done_softirq+0x69/0x80 > [] __do_softirq+0x75/0xe0 > [] call_softirq+0x1c/0x30 > [] do_softirq+0x35/0x90 > [] irq_exit+0x88/0x90 > [] do_IRQ+0x80/0x100 > [] ret_from_intr+0x0/0xa > > > Code: 0f 0b eb fe 66 90 48 83 ec 78 4c 89 64 24 58 4c 89 74 24 68 > RIP [] :aic79xx:ahd_linux_queue+0x58a/0x590 > RSP > Kernel panic - not syncing: Fatal exception > > > The hardware is an IBM xSeries 346 [8840ECY] machine, with > 2x dualcore CPUs and 6Gb Ram. It has 2 SCSI controllers - > one onboard 2-channel AIC-7902B, and one LSI Logic 53c1030 PCI-X > Fusion-MPT Dual Ultra320. Total 16 drives are attached to the > 2 controllers. > > There's a linux software raid10 array running over 14 drives > (7 drives on each controller), and an XFS filesystem on top of > it (410Gb). > > The problem (the above oops) happens almost immediately after > I'm trying to gzip some file on that filesystem - the system > dies within one minute of running gzip. The same happens when > I try to copy those files over NFS - the same instant lockup, > but happens later than with gzip. > > Please help!.... This is a critical piece of hardware. > > Thanks! > > /mjt -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/