2009-10-09 17:42:42

by Thomas Fjellstrom

[permalink] [raw]
Subject: MVSAS 1669:mvs_abort_task:rc= 5

Hi,

I've been trying to get an AOC-SASLP-MV8 card (pcie x4 2 port SAS card) to
work with linux for the past month or so. I've recently just RMAed my first
card, and tested the new one under linux, and I see the same problems.

The very first time I made a new array off the controller, formated (with xfs)
and mounted the volume, it seemed to work. ioozone even seemed to run for a
while. Sadly after a few minutes I got a stream of mvs_abort_task messages in
dmesg, and any accesses to the volume, or any disks connected to the
controller lock up.

After that I updated my 2.6.31 kernel to 2.6.32-rc3-git2 off of kernel.org,
and the volume fails to mount with the same mvs_abort_task messages.

I've attached my dmesg log from both attempts. dmesg1 is the first attempt,
dmesg2 is the second attempt on the newer kernel.

I would really appreciate some help with this.

--
Thomas Fjellstrom
[email protected]


Attachments:
test-data.dmesg1 (68.03 kB)
test-data.dmesg2 (67.29 kB)
Download all attachments

2009-10-10 15:57:07

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: MVSAS 1669:mvs_abort_task:rc= 5

On Fri October 9 2009, Thomas Fjellstrom wrote:
> Hi,
>
> I've been trying to get an AOC-SASLP-MV8 card (pcie x4 2 port SAS card) to
> work with linux for the past month or so. I've recently just RMAed my first
> card, and tested the new one under linux, and I see the same problems.
>
> The very first time I made a new array off the controller, formated (with
> xfs) and mounted the volume, it seemed to work. ioozone even seemed to run
> for a while. Sadly after a few minutes I got a stream of mvs_abort_task
> messages in dmesg, and any accesses to the volume, or any disks connected
> to the controller lock up.
>
> After that I updated my 2.6.31 kernel to 2.6.32-rc3-git2 off of kernel.org,
> and the volume fails to mount with the same mvs_abort_task messages.
>
> I've attached my dmesg log from both attempts. dmesg1 is the first attempt,
> dmesg2 is the second attempt on the newer kernel.
>
> I would really appreciate some help with this.
>

I should also mention that if I attempt to remove any of the drives while the
card is in a hung state, the entire machine locks up with the keyboard LEDs
blinking.

This does seem to be a rather serious error.

If there is more I can do to help track down the problems, please let me know.
I really need to get this to work (my current array is FULL, and may be
showing signs of failure) as soon as possible.

I can pretty much wipe and reinstall everything at any point on this machine,
update/install new kernels, etc. I just don't have a null modem cable atm to
hook up a serial console to capture the errors on OOPSes.

Thanks.

--
Thomas Fjellstrom
[email protected]

2009-10-10 19:19:08

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: MVSAS 1669:mvs_abort_task:rc= 5

On Sat October 10 2009, Ying Chu wrote:
> Hi, Thomas
>
> Did you build the dm/md storage with SATA disk drives?

Yes I did. I've got 5 Seagate 7200.12 1TB disks, and two 2TB WD Green drives.
The last two tests used solely the 5 Seagates though.

> On Sat, Oct 10, 2009 at 09:56:19AM -0600, Thomas Fjellstrom wrote:
> > From: Thomas Fjellstrom <[email protected]>
> > To: linux-kernel <[email protected]>
> > Subject: Re: MVSAS 1669:mvs_abort_task:rc= 5
> > Date: Sat, 10 Oct 2009 09:56:19 -0600
> > Cc: linux-raid <[email protected]>,
> > linux-scsi <[email protected]>
> > X-Mailing-List: [email protected]
> >
> > On Fri October 9 2009, Thomas Fjellstrom wrote:
> > > Hi,
> > >
> > > I've been trying to get an AOC-SASLP-MV8 card (pcie x4 2 port SAS card)
> > > to work with linux for the past month or so. I've recently just RMAed
> > > my first card, and tested the new one under linux, and I see the same
> > > problems.
> > >
> > > The very first time I made a new array off the controller, formated
> > > (with xfs) and mounted the volume, it seemed to work. ioozone even
> > > seemed to run for a while. Sadly after a few minutes I got a stream of
> > > mvs_abort_task messages in dmesg, and any accesses to the volume, or
> > > any disks connected to the controller lock up.
> > >
> > > After that I updated my 2.6.31 kernel to 2.6.32-rc3-git2 off of
> > > kernel.org, and the volume fails to mount with the same mvs_abort_task
> > > messages.
> > >
> > > I've attached my dmesg log from both attempts. dmesg1 is the first
> > > attempt, dmesg2 is the second attempt on the newer kernel.
> > >
> > > I would really appreciate some help with this.
> >
> > I should also mention that if I attempt to remove any of the drives while
> > the card is in a hung state, the entire machine locks up with the
> > keyboard LEDs blinking.
> >
> > This does seem to be a rather serious error.
> >
> > If there is more I can do to help track down the problems, please let me
> > know. I really need to get this to work (my current array is FULL, and
> > may be showing signs of failure) as soon as possible.
> >
> > I can pretty much wipe and reinstall everything at any point on this
> > machine, update/install new kernels, etc. I just don't have a null modem
> > cable atm to hook up a serial console to capture the errors on OOPSes.
> >
> > Thanks.
>


--
Thomas Fjellstrom
[email protected]

2009-10-11 18:45:22

by Christian Vilhelm

[permalink] [raw]
Subject: Re: MVSAS 1669:mvs_abort_task:rc= 5

Oct 11 20:15:14 almery kernel: md: bind<sdh>
Oct 11 20:15:14 almery kernel: md: bind<sdi>
Oct 11 20:15:14 almery kernel: md: bind<sdj>
Oct 11 20:15:14 almery kernel: md: bind<sdk>
Oct 11 20:15:14 almery kernel: md: bind<sdl>
Oct 11 20:15:14 almery kernel: md: bind<sdm>
Oct 11 20:15:14 almery kernel: raid5: device sdl operational as raid disk 4
Oct 11 20:15:14 almery kernel: raid5: device sdk operational as raid disk 3
Oct 11 20:15:14 almery kernel: raid5: device sdj operational as raid disk 2
Oct 11 20:15:14 almery kernel: raid5: device sdi operational as raid disk 1
Oct 11 20:15:14 almery kernel: raid5: device sdh operational as raid disk 0
Oct 11 20:15:14 almery kernel: raid5: allocated 6384kB for md1
Oct 11 20:15:14 almery kernel: raid5: raid level 5 set md1 active with 5 out of 6 devices, algorithm 2
Oct 11 20:15:14 almery kernel: RAID5 conf printout:
Oct 11 20:15:14 almery kernel: --- rd:6 wd:5
Oct 11 20:15:14 almery kernel: disk 0, o:1, dev:sdh
Oct 11 20:15:14 almery kernel: disk 1, o:1, dev:sdi
Oct 11 20:15:14 almery kernel: disk 2, o:1, dev:sdj
Oct 11 20:15:14 almery kernel: disk 3, o:1, dev:sdk
Oct 11 20:15:14 almery kernel: disk 4, o:1, dev:sdl
Oct 11 20:15:14 almery kernel: md1: detected capacity change from 0 to 2500536565760
Oct 11 20:15:14 almery kernel: RAID5 conf printout:
Oct 11 20:15:14 almery kernel: --- rd:6 wd:5
Oct 11 20:15:14 almery kernel: disk 0, o:1, dev:sdh
Oct 11 20:15:14 almery kernel: disk 1, o:1, dev:sdi
Oct 11 20:15:14 almery kernel: disk 2, o:1, dev:sdj
Oct 11 20:15:14 almery kernel: disk 3, o:1, dev:sdk
Oct 11 20:15:14 almery kernel: disk 4, o:1, dev:sdl
Oct 11 20:15:14 almery kernel: disk 5, o:1, dev:sdm
Oct 11 20:15:14 almery kernel: md: recovery of RAID array md1
Oct 11 20:15:14 almery kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Oct 11 20:15:14 almery kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Oct 11 20:15:14 almery kernel: md: using 128k window, over a total of 488386048 blocks.
Oct 11 20:15:14 almery ata_id[16774]: HDIO_GET_IDENTITY failed for '/dev/sdi'
Oct 11 20:15:14 almery ata_id[16782]: HDIO_GET_IDENTITY failed for '/dev/sdj'
Oct 11 20:15:14 almery ata_id[16785]: HDIO_GET_IDENTITY failed for '/dev/sdl'
Oct 11 20:15:14 almery ata_id[16786]: HDIO_GET_IDENTITY failed for '/dev/sdk'
Oct 11 20:15:14 almery ata_id[16790]: HDIO_GET_IDENTITY failed for '/dev/sdm'
Oct 11 20:15:44 almery kernel: md1:
Oct 11 20:15:44 almery kernel: sas: command 0xffff880138191600, task 0xffff8801399de380, timed out: BLK_EH_NOT_HANDLED
Oct 11 20:15:44 almery kernel: sas: command 0xffff880138191800, task 0xffff8801399de540, timed out: BLK_EH_NOT_HANDLED
Oct 11 20:15:44 almery kernel: sas: command 0xffff880138191000, task 0xffff8801399de000, timed out: BLK_EH_NOT_HANDLED
Oct 11 20:15:44 almery kernel: sas: command 0xffff880138191100, task 0xffff8801399de1c0, timed out: BLK_EH_NOT_HANDLED
Oct 11 20:15:44 almery kernel: sas: command 0xffff880138191900, task 0xffff8801399de700, timed out: BLK_EH_NOT_HANDLED
Oct 11 20:15:44 almery kernel: sas: command 0xffff88013ac77800, task 0xffff88013ea19500, timed out: BLK_EH_NOT_HANDLED
Oct 11 20:15:44 almery kernel: sas: Enter sas_scsi_recover_host
Oct 11 20:15:44 almery kernel: sas: trying to find task 0xffff8801399de380
Oct 11 20:15:44 almery kernel: sas: sas_scsi_find_task: aborting task 0xffff8801399de380
Oct 11 20:15:44 almery kernel: drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
Oct 11 20:15:44 almery kernel: sas: sas_scsi_find_task: querying task 0xffff8801399de380
Oct 11 20:15:44 almery kernel: drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
Oct 11 20:15:44 almery kernel: sas: sas_scsi_find_task: task 0xffff8801399de380 failed to abort
Oct 11 20:15:44 almery kernel: sas: task 0xffff8801399de380 is not at LU: I_T recover
Oct 11 20:15:44 almery kernel: sas: I_T nexus reset for dev 5001b4d5020e2000
Oct 11 20:15:44 almery kernel: sas: I_T 5001b4d5020e2000 recovered


Attachments:
syslog.out (13.75 kB)

2009-10-11 22:59:01

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: MVSAS 1669:mvs_abort_task:rc= 5

On Sun October 11 2009, Christian Vilhelm wrote:
> Thomas Fjellstrom wrote:
> > Hi,
> >
> > I've been trying to get an AOC-SASLP-MV8 card (pcie x4 2 port SAS card)
> > to work with linux for the past month or so. I've recently just RMAed my
> > first card, and tested the new one under linux, and I see the same
> > problems.
> >
> > The very first time I made a new array off the controller, formated (with
> > xfs) and mounted the volume, it seemed to work. ioozone even seemed to
> > run for a while. Sadly after a few minutes I got a stream of
> > mvs_abort_task messages in dmesg, and any accesses to the volume, or any
> > disks connected to the controller lock up.
> >
> > After that I updated my 2.6.31 kernel to 2.6.32-rc3-git2 off of
> > kernel.org, and the volume fails to mount with the same mvs_abort_task
> > messages.
>
> I have the exact same problem with another Marvell 88SE64xx based card,
> namely an Areca ARC-1300ix-16 and the mvsas driver.
> If the disks are just used alone, with a filesystem on them, all seems
> to work fine. dd and badblocks run fine on them. Mounting them,
> reading/writing work fine. The error seem to popup but rarely when
> several disks are used simultaneously.
> But, an absolute sure way to trigger the error is to assemble (or
> create) a md raid array with the disks. I join a syslog extract from the
> error. You can see it happens seconds after the array creation.
> I tried :
> 1) disabling the write cache on the disks => same error
> 2) disabling NCQ : in mv_sas.h :
> #define MV_DISABLE_NCQ 1
> same error.
> Afer a while, the devices handled by the card are just dropped from the
> system and the card stops working at all, a reboot is necessary.

I have found that a proper reboot is impossible once the card/driver starts
misbehaving. Anything that tries to do anything with the md device, or any of
the component drives will hang. Even kernel threads it seems. A reboot or a
shutdown hangs when it tries to sync the md device, and ALT+SYSRQ+S/U both
hang. After the first Alt+sysrq+s it will register more of them, but it won't
print the "Emergency Sync Complete" message.

> Does anyone have a working config based on a Marvell 64xx card ?
>
> I'm willing to explore solutions, patches or anything, just tell me what
> to do to help.
>
> Christian Vilhelm.
>


--
Thomas Fjellstrom
[email protected]

2009-10-14 01:40:07

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: MVSAS 1669:mvs_abort_task:rc= 5

On Sun October 11 2009, Thomas Fjellstrom wrote:
> On Sun October 11 2009, Christian Vilhelm wrote:
> > Thomas Fjellstrom wrote:
> > > Hi,
> > >
> > > I've been trying to get an AOC-SASLP-MV8 card (pcie x4 2 port SAS card)
> > > to work with linux for the past month or so. I've recently just RMAed
> > > my first card, and tested the new one under linux, and I see the same
> > > problems.
> > >
> > > The very first time I made a new array off the controller, formated
> > > (with xfs) and mounted the volume, it seemed to work. ioozone even
> > > seemed to run for a while. Sadly after a few minutes I got a stream of
> > > mvs_abort_task messages in dmesg, and any accesses to the volume, or
> > > any disks connected to the controller lock up.
> > >
> > > After that I updated my 2.6.31 kernel to 2.6.32-rc3-git2 off of
> > > kernel.org, and the volume fails to mount with the same mvs_abort_task
> > > messages.
> >
> > I have the exact same problem with another Marvell 88SE64xx based card,
> > namely an Areca ARC-1300ix-16 and the mvsas driver.
> > If the disks are just used alone, with a filesystem on them, all seems
> > to work fine. dd and badblocks run fine on them. Mounting them,
> > reading/writing work fine. The error seem to popup but rarely when
> > several disks are used simultaneously.
> > But, an absolute sure way to trigger the error is to assemble (or
> > create) a md raid array with the disks. I join a syslog extract from the
> > error. You can see it happens seconds after the array creation.
> > I tried :
> > 1) disabling the write cache on the disks => same error
> > 2) disabling NCQ : in mv_sas.h :
> > #define MV_DISABLE_NCQ 1
> > same error.
> > Afer a while, the devices handled by the card are just dropped from the
> > system and the card stops working at all, a reboot is necessary.
>
> I have found that a proper reboot is impossible once the card/driver starts
> misbehaving. Anything that tries to do anything with the md device, or any
> of the component drives will hang. Even kernel threads it seems. A reboot
> or a shutdown hangs when it tries to sync the md device, and ALT+SYSRQ+S/U
> both hang. After the first Alt+sysrq+s it will register more of them, but
> it won't print the "Emergency Sync Complete" message.
>
> > Does anyone have a working config based on a Marvell 64xx card ?
> >
> > I'm willing to explore solutions, patches or anything, just tell me what
> > to do to help.
> >
> > Christian Vilhelm.
>

I'd really appreciate some assistance with this. The card is essentially
useless under linux, if not harmful (causes oopses and hangs) with the current
driver.

My last weekly backup failed while creating the disk image due to my array
being low on space, I really need to get the new array up asap.

Thanks.

--
Thomas Fjellstrom
[email protected]

2009-10-14 07:19:39

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: MVSAS 1669:mvs_abort_task:rc= 5

On Tue October 13 2009, andy yan wrote:
> I will send you a patch for debugging this issue, please help to try and
> send back the log, thanks!

I will do whatever I can to help get this resolved :) I have some C skills,
but no kernel/device driver experience, so at the very least I should be able
to do builds and make small changes if needed, in addition to patching and
endless reboots ;D

> On Wed, Oct 14, 2009 at 9:39 AM, Thomas Fjellstrom
<[email protected]>wrote:
> > On Sun October 11 2009, Thomas Fjellstrom wrote:
> > > On Sun October 11 2009, Christian Vilhelm wrote:
> > > > Thomas Fjellstrom wrote:
> > > > > Hi,
> > > > >
> > > > > I've been trying to get an AOC-SASLP-MV8 card (pcie x4 2 port SAS
> >
> > card)
> >
> > > > > to work with linux for the past month or so. I've recently just
> > > > > RMAed my first card, and tested the new one under linux, and I see
> > > > > the same problems.
> > > > >
> > > > > The very first time I made a new array off the controller, formated
> > > > > (with xfs) and mounted the volume, it seemed to work. ioozone even
> > > > > seemed to run for a while. Sadly after a few minutes I got a stream
> >
> > of
> >
> > > > > mvs_abort_task messages in dmesg, and any accesses to the volume,
> > > > > or any disks connected to the controller lock up.
> > > > >
> > > > > After that I updated my 2.6.31 kernel to 2.6.32-rc3-git2 off of
> > > > > kernel.org, and the volume fails to mount with the same
> >
> > mvs_abort_task
> >
> > > > > messages.
> > > >
> > > > I have the exact same problem with another Marvell 88SE64xx based
> > > > card, namely an Areca ARC-1300ix-16 and the mvsas driver.
> > > > If the disks are just used alone, with a filesystem on them, all
> > > > seems to work fine. dd and badblocks run fine on them. Mounting them,
> > > > reading/writing work fine. The error seem to popup but rarely when
> > > > several disks are used simultaneously.
> > > > But, an absolute sure way to trigger the error is to assemble (or
> > > > create) a md raid array with the disks. I join a syslog extract from
> >
> > the
> >
> > > > error. You can see it happens seconds after the array creation.
> > > > I tried :
> > > > 1) disabling the write cache on the disks => same error
> > > > 2) disabling NCQ : in mv_sas.h :
> > > > #define MV_DISABLE_NCQ 1
> > > > same error.
> > > > Afer a while, the devices handled by the card are just dropped from
> > > > the system and the card stops working at all, a reboot is necessary.
> > >
> > > I have found that a proper reboot is impossible once the card/driver
> >
> > starts
> >
> > > misbehaving. Anything that tries to do anything with the md device, or
> >
> > any
> >
> > > of the component drives will hang. Even kernel threads it seems. A
> >
> > reboot
> >
> > > or a shutdown hangs when it tries to sync the md device, and
> >
> > ALT+SYSRQ+S/U
> >
> > > both hang. After the first Alt+sysrq+s it will register more of them,
> >
> > but
> >
> > > it won't print the "Emergency Sync Complete" message.
> > >
> > > > Does anyone have a working config based on a Marvell 64xx card ?
> > > >
> > > > I'm willing to explore solutions, patches or anything, just tell me
> >
> > what
> >
> > > > to do to help.
> > > >
> > > > Christian Vilhelm.
> >
> > I'd really appreciate some assistance with this. The card is essentially
> > useless under linux, if not harmful (causes oopses and hangs) with the
> > current
> > driver.
> >
> > My last weekly backup failed while creating the disk image due to my
> > array being low on space, I really need to get the new array up asap.
> >
> > Thanks.
> >
> > --
> > Thomas Fjellstrom
> > [email protected]
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>


--
Thomas Fjellstrom
[email protected]

2009-10-14 07:55:56

by Christian Vilhelm

[permalink] [raw]
Subject: Re: MVSAS 1669:mvs_abort_task:rc= 5

Thomas Fjellstrom wrote:
> On Tue October 13 2009, andy yan wrote:
>> I will send you a patch for debugging this issue, please help to try and
>> send back the log, thanks!
>
> I will do whatever I can to help get this resolved :) I have some C skills,
> but no kernel/device driver experience, so at the very least I should be able
> to do builds and make small changes if needed, in addition to patching and
> endless reboots ;D

I'm also willing to help.
The card is not on a production server and the disks connected to the
card do not contain any valuable data so I can make any test wanted.

When the problem occurs it seems the devices (disks) are hosed. Deleting
them from the system (echo 1 > /sys/block/sdh/device/delete), removing
the mvsas module (rmmod -f) and reloading it doesn't work. The card
seems correctly initialised after reloading the module, it correctly
responds to commands (in /sys/class/sas_phy/ and sas_ports, I can reset
ports/phys, I can ask for a rescan of disks). But the disks themselves
do not seem to answer to the scan and are not detected, all I get is :

Oct 13 15:17:33 almery kernel: [29162.468218] sas: sas_ata_phy_reset:
Found ATA device.
Oct 13 15:17:33 almery kernel: [29162.470279] ata19.00: both IDENTIFYs
aborted, assuming NODEV
Oct 13 15:17:33 almery kernel: [29162.470321] sas: sas_ata_phy_reset:
Found ATA device.
Oct 13 15:17:33 almery kernel: [29162.472391] ata19.00: both IDENTIFYs
aborted, assuming NODEV
Oct 13 15:17:33 almery kernel: [29162.472433] sas: sas_ata_phy_reset:
Found ATA device.
Oct 13 15:17:33 almery kernel: [29162.474492] ata19.00: both IDENTIFYs
aborted, assuming NODEV
Oct 13 15:17:33 almery kernel: [29162.474533] ata19.00: disabled
Oct 13 15:17:33 almery kernel: [29162.474572] sas: sas_ata_phy_reset:
Found ATA device.
Oct 13 15:17:33 almery kernel: [29162.474627] scsi_alloc_sdev:
Allocation failure during SCSI scanning, some SCSI devices might not be
configured



Is there a way to get a disk to reinitialize itself without a reboot ?

Drives are SAMSUNG HD501LJ
Linux almery 2.6.31.1-vs2.3.0.36.14 #7 SMP Mon Oct 12 12:58:07 CEST 2009
x86_64 GNU/Linux
with or withous vserver patch applied : same problem, kernel not tainted.

The problem occurs also when the disks are not in an md array.

Christian Vilhelm.

--
/~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\
| Christian Vilhelm : [email protected] |
| Reality is for people who lack imagination |
\____________________________________________________________________/