2010-11-17 07:53:46

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: mvsas errors in 2.6.36

On October 31, 2010, Thomas Fjellstrom wrote:
> On October 29, 2010, Thomas Fjellstrom wrote:
> > Good news and bad news, the current mvsas driver in 2.6.36 seems to work
> > better than older kernels with my setup (2 port sas + 5 SATA disks). But
> > I gotten the following messages so far:
> >
[snip]
> > I did not unplug a disk, the errors seem to be spurious.
> >
> > Otherwise though things seem to be working. At least so far. The
> > mv_abort_task part is very familiar, the older version of this driver
> > would do it right after attempting to build/activate the md raid5 array
> > that lives on this controller. Except the controller would lock up, and
> > all drives would become inaccessible.
> >
> > I'm going to attempt to grow this array today, so long as the xfs_fsr
> > that I started doesn't cause the array to fail.
> >
> > If I keep getting mv_abort_task errors, I'll have to back down to the
> > copy of the driver I got from Andy Yan. I've managed to patch it up to
> > compile for 2.6.36 just now, I just hope it'll work at least as well as
> > it did with 2.6.34. At the very least I didn't get these errors.
> >
> > Some background, the disks attached to the card are (5) Seagate 7200.12
> > 1TB disks, using SAS->SATA cables. Machine is a amd64 Phenom II X4 810
> > w/4G ram running debian sid and a vanila 2.6.36 kernel. The card is a
> > AOC-SASLP-MV8, according to lspci:
> >
> > 04:00.0 SCSI storage controller: Marvell Technology Group Ltd.
> > MV64460/64461/64462 System Controller, Revision B (rev 01)
> >
> > according to dmesg:
> >
[snip]
> > I just hope the raid5 reshape I'm about to do doesn't crap its pants
> > because of the errors above.
> >
> > I'd like to help test any fixes or changes if needed. Let me know.
> >
> > Thanks again.
>
> After a couple days of uptime, the messages are still happening:
>
[snip]
> No fatal errors yet.

Still no fatal errors, but the problem is still happening regularly. It causes
a pause in disk io of a couple seconds at least. Really quite annoying.

One thing thats got me wondering, is could this be a power issue? It almost
seems like (from the messages) that a single drive (any drive) is freaking
out, and returning an error that probably shouldn't happen (no CHS 0?), which
could mean the drive is underpowered and the firmware is flipping out. I'm not
entirely sure. The system has a 750w decent quality Antec power supply. The
total power use of the system shouldn't come over half that (phenom II x4 810
cpu, gigabyte ma790fxtud5p mb, low profile nvidia 9400GS gpu, 8 sata hdds, 3
fans, etc). I'm /mostly/ sure the 12v rails are spread out evenly, but I have
yet to make absolutely sure.

But then it doesn't seem as if the root drives are ever flipping out. Theres
two 500GB Seagate 7200.12 drives md raid1'ed on the motherboard's (SB750) sata
II controller. They work fine, no messages regarding them at all the entire
time. However I get frequent and repeated messages from all drives on the
mvsas based controller.

So color me stumped.

--
Thomas Fjellstrom
[email protected]


2010-11-17 08:33:14

by Andre Tomt

[permalink] [raw]
Subject: Re: mvsas errors in 2.6.36

On 11/17/2010 08:53 AM, Thomas Fjellstrom wrote:
[snip]
> Still no fatal errors, but the problem is still happening regularly. It causes
> a pause in disk io of a couple seconds at least. Really quite annoying.
[snip]

After the mvsas update in 2.6.35 this started happening to me as well;
at least its better than the previous state - not working.. ;-) However,
after rolling a new 2.6.35 with the following fix that is queued up for
the upcoming 2.6.35 and 2.6.36 stable releases, they seem to have
dissapeared - 3 days and counting.

http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blob_plain;f=queue-2.6.33/libsas-fix-ncq-mixing-with-non-ncq.patch;h=b6d7c92094d95ad67a3b23c2e09c25d4fbd0f46b;hb=HEAD

The fix is queued up for the next 2.6.36 and 2.6.35 stable point-releases.

2010-12-02 06:30:26

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: mvsas errors in 2.6.36

On November 17, 2010, you wrote:
> On 11/17/2010 08:53 AM, Thomas Fjellstrom wrote:
> [snip]
>
> > Still no fatal errors, but the problem is still happening regularly. It
> > causes a pause in disk io of a couple seconds at least. Really quite
> > annoying.
> >
> > One thing thats got me wondering, is could this be a power issue?
> > It almost seems like (from the messages) that a single drive (any drive)
> > is freaking out, and returning an error that probably shouldn't happen (no
> > CHS 0?), which could mean the drive is underpowered and the firmware is
> > flipping out. I'm not entirely sure. The system has a 750w decent quality
> > Antec power supply. The total power use of the system shouldn't come over
> > half that (phenom II x4 810 cpu, gigabyte ma790fxtud5p mb, low profile
> > nvidia 9400GS gpu, 8 sata hdds, 3 fans, etc). I'm mostly sure the 12v
> > rails are spread out evenly, but I have yet to make absolutely sure.

Made absolute sure. I had been worrying that I was overloading one of the
rails on the PSU, but it turns out that it isn't a multi 12v rail PSU after
all. The box and advertising says it is, but the electronics inside all say
its a single 12v rail device.

> [snip]
>
> After the mvsas update in 2.6.35 this started happening to me as well;
> at least its better than the previous state - not working.. ;-) However,
> after rolling a new 2.6.35 with the following fix that is queued up for
> the upcoming 2.6.35 and 2.6.36 stable releases, they seem to have
> dissapeared - 3 days and counting.
>
> http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blob_pl
> ain;f=queue-2.6.33/libsas-fix-ncq-mixing-with-non-ncq.patch;h=b6d7c92094d95
> ad67a3b23c2e09c25d4fbd0f46b;hb=HEAD
>
> The fix is queued up for the next 2.6.36 and 2.6.35 stable point-releases.

Ahah. I wonder how I missed that when I first read it. I'll have to give the
stable .36 kernel a try. Thanks!


--
Thomas Fjellstrom
[email protected]

2010-12-02 09:49:12

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: mvsas errors in 2.6.36

On December 1, 2010, Thomas Fjellstrom wrote:
> On November 17, 2010, you wrote:
> > On 11/17/2010 08:53 AM, Thomas Fjellstrom wrote:
> > [snip]
> >
> > > Still no fatal errors, but the problem is still happening regularly. It
> > > causes a pause in disk io of a couple seconds at least. Really quite
> > > annoying.
> > >
> > > One thing thats got me wondering, is could this be a power issue?
> > > It almost seems like (from the messages) that a single drive (any
> > > drive) is freaking out, and returning an error that probably shouldn't
> > > happen (no CHS 0?), which could mean the drive is underpowered and the
> > > firmware is flipping out. I'm not entirely sure. The system has a 750w
> > > decent quality Antec power supply. The total power use of the system
> > > shouldn't come over half that (phenom II x4 810 cpu, gigabyte
> > > ma790fxtud5p mb, low profile nvidia 9400GS gpu, 8 sata hdds, 3 fans,
> > > etc). I'm mostly sure the 12v rails are spread out evenly, but I have
> > > yet to make absolutely sure.
>
> Made absolute sure. I had been worrying that I was overloading one of the
> rails on the PSU, but it turns out that it isn't a multi 12v rail PSU after
> all. The box and advertising says it is, but the electronics inside all say
> its a single 12v rail device.
>
> > [snip]
> >
> > After the mvsas update in 2.6.35 this started happening to me as well;
> > at least its better than the previous state - not working.. ;-) However,
> > after rolling a new 2.6.35 with the following fix that is queued up for
> > the upcoming 2.6.35 and 2.6.36 stable releases, they seem to have
> > dissapeared - 3 days and counting.
> >
> > http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blob_
> > pl
> > ain;f=queue-2.6.33/libsas-fix-ncq-mixing-with-non-ncq.patch;h=b6d7c92094
> > d95 ad67a3b23c2e09c25d4fbd0f46b;hb=HEAD
> >
> > The fix is queued up for the next 2.6.36 and 2.6.35 stable
> > point-releases.
>
> Ahah. I wonder how I missed that when I first read it. I'll have to give
> the stable .36 kernel a try. Thanks!

No fix so far:

[ 2539.040104] drivers/scsi/mvsas/mv_sas.c 1703:<7>mv_abort_task() mvi=ffff880222f00000 task=ffff88018b3e2980 slot=ffff880222f265d0 slot_idx=x2
[ 2539.040118] drivers/scsi/mvsas/mv_sas.c 1632:mvs_query_task:rc= 5
[ 2539.040154] drivers/scsi/mvsas/mv_sas.c 2083:port 7 ctrl sts=0x89800.
[ 2539.040163] drivers/scsi/mvsas/mv_sas.c 2085:Port 7 irq sts = 0x1001001
[ 2539.040176] drivers/scsi/mvsas/mv_sas.c 2111:phy7 Unplug Notice
[ 2539.050220] drivers/scsi/mvsas/mv_sas.c 2083:port 7 ctrl sts=0x199800.
[ 2539.050229] drivers/scsi/mvsas/mv_sas.c 2085:Port 7 irq sts = 0x1001081
[ 2539.071157] drivers/scsi/mvsas/mv_sas.c 2083:port 7 ctrl sts=0x199800.
[ 2539.071165] drivers/scsi/mvsas/mv_sas.c 2085:Port 7 irq sts = 0x10000
[ 2539.071173] drivers/scsi/mvsas/mv_sas.c 2138:notify plug in on phy[7]
[ 2539.081142] drivers/scsi/mvsas/mv_sas.c 1224:port 7 attach dev info is 5000002
[ 2539.081142] drivers/scsi/mvsas/mv_sas.c 1226:port 7 attach sas addr is 7
[ 2539.081142] drivers/scsi/mvsas/mv_sas.c 378:phy 7 byte dmaded.
[ 2541.270047] drivers/scsi/mvsas/mv_sas.c 1586:mvs_I_T_nexus_reset for device[5]:rc= 0
[ 2541.270066] ata14: translated ATA stat/err 0x01/04 to SCSI SK/ASC/ASCQ 0xb/00/00
[ 2541.270926] ata14: status=0x01 { Error }
[ 2541.271747] ata14: error=0x04 { DriveStatusError }

That appeared after about 42 minutes of uptime.

--
Thomas Fjellstrom
[email protected]

2010-12-03 16:39:53

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: mvsas errors in 2.6.36

On December 2, 2010, Thomas Fjellstrom wrote:
> On December 1, 2010, Thomas Fjellstrom wrote:
> > On November 17, 2010, you wrote:
> > > On 11/17/2010 08:53 AM, Thomas Fjellstrom wrote:
> > > [snip]
> > >
> > > > Still no fatal errors, but the problem is still happening regularly.
> > > > It causes a pause in disk io of a couple seconds at least. Really
> > > > quite annoying.
> > > >
> > > > One thing thats got me wondering, is could this be a power issue?
> > > > It almost seems like (from the messages) that a single drive (any
> > > > drive) is freaking out, and returning an error that probably
> > > > shouldn't happen (no CHS 0?), which could mean the drive is
> > > > underpowered and the firmware is flipping out. I'm not entirely
> > > > sure. The system has a 750w decent quality Antec power supply. The
> > > > total power use of the system shouldn't come over half that (phenom
> > > > II x4 810 cpu, gigabyte ma790fxtud5p mb, low profile nvidia 9400GS
> > > > gpu, 8 sata hdds, 3 fans, etc). I'm mostly sure the 12v rails are
> > > > spread out evenly, but I have yet to make absolutely sure.
> >
> > Made absolute sure. I had been worrying that I was overloading one of the
> > rails on the PSU, but it turns out that it isn't a multi 12v rail PSU
> > after all. The box and advertising says it is, but the electronics
> > inside all say its a single 12v rail device.
> >
> > > [snip]
> > >
> > > After the mvsas update in 2.6.35 this started happening to me as well;
> > > at least its better than the previous state - not working.. ;-)
> > > However, after rolling a new 2.6.35 with the following fix that is
> > > queued up for the upcoming 2.6.35 and 2.6.36 stable releases, they
> > > seem to have dissapeared - 3 days and counting.
> > >
> > > http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blo
> > > b_ pl
> > > ain;f=queue-2.6.33/libsas-fix-ncq-mixing-with-non-ncq.patch;h=b6d7c9209
> > > 4 d95 ad67a3b23c2e09c25d4fbd0f46b;hb=HEAD
> > >
> > > The fix is queued up for the next 2.6.36 and 2.6.35 stable
> > > point-releases.
> >
> > Ahah. I wonder how I missed that when I first read it. I'll have to give
> > the stable .36 kernel a try. Thanks!
>
> No fix so far:
>
> [ 2539.040104] drivers/scsi/mvsas/mv_sas.c 1703:<7>mv_abort_task()
> mvi=ffff880222f00000 task=ffff88018b3e2980 slot=ffff880222f265d0
> slot_idx=x2 [ 2539.040118] drivers/scsi/mvsas/mv_sas.c
> 1632:mvs_query_task:rc= 5 [ 2539.040154] drivers/scsi/mvsas/mv_sas.c
> 2083:port 7 ctrl sts=0x89800. [ 2539.040163] drivers/scsi/mvsas/mv_sas.c
> 2085:Port 7 irq sts = 0x1001001 [ 2539.040176] drivers/scsi/mvsas/mv_sas.c
> 2111:phy7 Unplug Notice [ 2539.050220] drivers/scsi/mvsas/mv_sas.c
> 2083:port 7 ctrl sts=0x199800. [ 2539.050229] drivers/scsi/mvsas/mv_sas.c
> 2085:Port 7 irq sts = 0x1001081 [ 2539.071157] drivers/scsi/mvsas/mv_sas.c
> 2083:port 7 ctrl sts=0x199800. [ 2539.071165] drivers/scsi/mvsas/mv_sas.c
> 2085:Port 7 irq sts = 0x10000 [ 2539.071173] drivers/scsi/mvsas/mv_sas.c
> 2138:notify plug in on phy[7] [ 2539.081142] drivers/scsi/mvsas/mv_sas.c
> 1224:port 7 attach dev info is 5000002 [ 2539.081142]
> drivers/scsi/mvsas/mv_sas.c 1226:port 7 attach sas addr is 7 [
> 2539.081142] drivers/scsi/mvsas/mv_sas.c 378:phy 7 byte dmaded.
> [ 2541.270047] drivers/scsi/mvsas/mv_sas.c 1586:mvs_I_T_nexus_reset for
> device[5]:rc= 0 [ 2541.270066] ata14: translated ATA stat/err 0x01/04 to
> SCSI SK/ASC/ASCQ 0xb/00/00 [ 2541.270926] ata14: status=0x01 { Error }
> [ 2541.271747] ata14: error=0x04 { DriveStatusError }
>
> That appeared after about 42 minutes of uptime.

So after about 32 hours of uptime theres been 36 separate events. Each spits
out similar messages as above, and each comes with a noticeable pause while
the drive is reset.

There are a number of possible reasons that I'm still having issues:
- I managed to mess up the git checkout
- My problem isn't related to the fix
- The fix doesn't cover all cases of the problem it meant to fix

I'm not certain which of them it is, I'd be more inclined to think I messed up
the checkout, as I did patch something in, but the patches were completely
unrelated and shouldn't have affected the scsi or ata systems at all. At this
point I'm just grasping at straws.

In case my card is somehow different than expected, I'll paste the lspci info
for it: (AOC-SASLP-MV8)

04:00.0 SCSI storage controller: Marvell Technology Group Ltd. MV64460/64461/64462 System Controller, Revision B (rev 01)
Subsystem: Super Micro Computer Inc Device 0500
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 19
Region 2: I/O ports at df00 [size=128]
Region 4: Memory at fdef0000 (64-bit, non-prefetchable) [size=64K]
[virtual] Expansion ROM at fdd00000 [disabled] [size=256K]
Capabilities: [48] Power Management version 2
Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [e0] Express (v1) Legacy Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 2048 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 <256ns, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Kernel driver in use: mvsas

Its installed in a Phenom II X4 810 based system with a 790FX/SB750 chipset,
8G DDR3 1333 RAM, 6 1TB Seagate 7200.12 SATAII drives connected to the
card via sas->sata breakout cables, and a couple 4 drive SATA hotswap bays.
There are also two Seagate 7200.12 500G drives hooked up to the motherboard
SATA controller. The system is powered via an Antec Neopower Blue 650W PSU
which is probably only half loaded. System also has a discreet gfx card, but its
a low end, low profile, fanless card that takes up next to no power.

I'm still willing to help test any fixes for the mvsas driver on this card.

Thank you.

--
Thomas Fjellstrom
[email protected]

2010-12-03 20:31:20

by David Milburn

[permalink] [raw]
Subject: Re: mvsas errors in 2.6.36

Thomas Fjellstrom wrote:
> On December 2, 2010, Thomas Fjellstrom wrote:
>> On December 1, 2010, Thomas Fjellstrom wrote:
>>> On November 17, 2010, you wrote:
>>>> On 11/17/2010 08:53 AM, Thomas Fjellstrom wrote:
>>>> [snip]
>>>>
>>>>> Still no fatal errors, but the problem is still happening regularly.
>>>>> It causes a pause in disk io of a couple seconds at least. Really
>>>>> quite annoying.
>>>>>
>>>>> One thing thats got me wondering, is could this be a power issue?
>>>>> It almost seems like (from the messages) that a single drive (any
>>>>> drive) is freaking out, and returning an error that probably
>>>>> shouldn't happen (no CHS 0?), which could mean the drive is
>>>>> underpowered and the firmware is flipping out. I'm not entirely
>>>>> sure. The system has a 750w decent quality Antec power supply. The
>>>>> total power use of the system shouldn't come over half that (phenom
>>>>> II x4 810 cpu, gigabyte ma790fxtud5p mb, low profile nvidia 9400GS
>>>>> gpu, 8 sata hdds, 3 fans, etc). I'm mostly sure the 12v rails are
>>>>> spread out evenly, but I have yet to make absolutely sure.
>>> Made absolute sure. I had been worrying that I was overloading one of the
>>> rails on the PSU, but it turns out that it isn't a multi 12v rail PSU
>>> after all. The box and advertising says it is, but the electronics
>>> inside all say its a single 12v rail device.
>>>
>>>> [snip]
>>>>
>>>> After the mvsas update in 2.6.35 this started happening to me as well;
>>>> at least its better than the previous state - not working.. ;-)
>>>> However, after rolling a new 2.6.35 with the following fix that is
>>>> queued up for the upcoming 2.6.35 and 2.6.36 stable releases, they
>>>> seem to have dissapeared - 3 days and counting.
>>>>
>>>> http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blo
>>>> b_ pl
>>>> ain;f=queue-2.6.33/libsas-fix-ncq-mixing-with-non-ncq.patch;h=b6d7c9209
>>>> 4 d95 ad67a3b23c2e09c25d4fbd0f46b;hb=HEAD
>>>>
>>>> The fix is queued up for the next 2.6.36 and 2.6.35 stable
>>>> point-releases.
>>> Ahah. I wonder how I missed that when I first read it. I'll have to give
>>> the stable .36 kernel a try. Thanks!
>> No fix so far:
>>
>> [ 2539.040104] drivers/scsi/mvsas/mv_sas.c 1703:<7>mv_abort_task()
>> mvi=ffff880222f00000 task=ffff88018b3e2980 slot=ffff880222f265d0
>> slot_idx=x2 [ 2539.040118] drivers/scsi/mvsas/mv_sas.c
>> 1632:mvs_query_task:rc= 5 [ 2539.040154] drivers/scsi/mvsas/mv_sas.c
>> 2083:port 7 ctrl sts=0x89800. [ 2539.040163] drivers/scsi/mvsas/mv_sas.c
>> 2085:Port 7 irq sts = 0x1001001 [ 2539.040176] drivers/scsi/mvsas/mv_sas.c
>> 2111:phy7 Unplug Notice [ 2539.050220] drivers/scsi/mvsas/mv_sas.c

The controller is reporting a phy ready state change, which is why you see
the unplug notice.

Can you enable SCSI_SAS_LIBSAS_DEBUG and see if libsas reports anything
before the abort?

You should be able to turn on in your kernel config:

Device Drivers
SCSI device support
SCSI Transports
Compile the SAS Domain Transport Attributes in debug mode

Thanks,
David

>> 2083:port 7 ctrl sts=0x199800. [ 2539.050229] drivers/scsi/mvsas/mv_sas.c
>> 2085:Port 7 irq sts = 0x1001081 [ 2539.071157] drivers/scsi/mvsas/mv_sas.c
>> 2083:port 7 ctrl sts=0x199800. [ 2539.071165] drivers/scsi/mvsas/mv_sas.c
>> 2085:Port 7 irq sts = 0x10000 [ 2539.071173] drivers/scsi/mvsas/mv_sas.c
>> 2138:notify plug in on phy[7] [ 2539.081142] drivers/scsi/mvsas/mv_sas.c
>> 1224:port 7 attach dev info is 5000002 [ 2539.081142]
>> drivers/scsi/mvsas/mv_sas.c 1226:port 7 attach sas addr is 7 [
>> 2539.081142] drivers/scsi/mvsas/mv_sas.c 378:phy 7 byte dmaded.
>> [ 2541.270047] drivers/scsi/mvsas/mv_sas.c 1586:mvs_I_T_nexus_reset for
>> device[5]:rc= 0 [ 2541.270066] ata14: translated ATA stat/err 0x01/04 to
>> SCSI SK/ASC/ASCQ 0xb/00/00 [ 2541.270926] ata14: status=0x01 { Error }
>> [ 2541.271747] ata14: error=0x04 { DriveStatusError }
>>
>> That appeared after about 42 minutes of uptime.
>
> So after about 32 hours of uptime theres been 36 separate events. Each spits
> out similar messages as above, and each comes with a noticeable pause while
> the drive is reset.
>
> There are a number of possible reasons that I'm still having issues:
> - I managed to mess up the git checkout
> - My problem isn't related to the fix
> - The fix doesn't cover all cases of the problem it meant to fix
>
> I'm not certain which of them it is, I'd be more inclined to think I messed up
> the checkout, as I did patch something in, but the patches were completely
> unrelated and shouldn't have affected the scsi or ata systems at all. At this
> point I'm just grasping at straws.
>
> In case my card is somehow different than expected, I'll paste the lspci info
> for it: (AOC-SASLP-MV8)
>
> 04:00.0 SCSI storage controller: Marvell Technology Group Ltd. MV64460/64461/64462 System Controller, Revision B (rev 01)
> Subsystem: Super Micro Computer Inc Device 0500
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 19
> Region 2: I/O ports at df00 [size=128]
> Region 4: Memory at fdef0000 (64-bit, non-prefetchable) [size=64K]
> [virtual] Expansion ROM at fdd00000 [disabled] [size=256K]
> Capabilities: [48] Power Management version 2
> Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold-)
> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
> Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> Address: 0000000000000000 Data: 0000
> Capabilities: [e0] Express (v1) Legacy Endpoint, MSI 00
> DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> MaxPayload 128 bytes, MaxReadReq 2048 bytes
> DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
> LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 <256ns, L1 unlimited
> ClockPM- Surprise- LLActRep- BwNot-
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> Capabilities: [100 v1] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
> Kernel driver in use: mvsas
>
> Its installed in a Phenom II X4 810 based system with a 790FX/SB750 chipset,
> 8G DDR3 1333 RAM, 6 1TB Seagate 7200.12 SATAII drives connected to the
> card via sas->sata breakout cables, and a couple 4 drive SATA hotswap bays.
> There are also two Seagate 7200.12 500G drives hooked up to the motherboard
> SATA controller. The system is powered via an Antec Neopower Blue 650W PSU
> which is probably only half loaded. System also has a discreet gfx card, but its
> a low end, low profile, fanless card that takes up next to no power.
>
> I'm still willing to help test any fixes for the mvsas driver on this card.
>
> Thank you.
>

2010-12-04 06:57:41

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: mvsas errors in 2.6.36

On December 3, 2010, David Milburn wrote:
> Thomas Fjellstrom wrote:
> > On December 2, 2010, Thomas Fjellstrom wrote:
> >> On December 1, 2010, Thomas Fjellstrom wrote:
> >>> On November 17, 2010, you wrote:
> >>>> On 11/17/2010 08:53 AM, Thomas Fjellstrom wrote:
> >>>> [snip]
> >>>>
> >>>>> Still no fatal errors, but the problem is still happening regularly.
> >>>>> It causes a pause in disk io of a couple seconds at least. Really
> >>>>> quite annoying.
> >>>>>
> >>>>> One thing thats got me wondering, is could this be a power issue?
> >>>>> It almost seems like (from the messages) that a single drive (any
> >>>>> drive) is freaking out, and returning an error that probably
> >>>>> shouldn't happen (no CHS 0?), which could mean the drive is
> >>>>> underpowered and the firmware is flipping out. I'm not entirely
> >>>>> sure. The system has a 750w decent quality Antec power supply. The
> >>>>> total power use of the system shouldn't come over half that (phenom
> >>>>> II x4 810 cpu, gigabyte ma790fxtud5p mb, low profile nvidia 9400GS
> >>>>> gpu, 8 sata hdds, 3 fans, etc). I'm mostly sure the 12v rails are
> >>>>> spread out evenly, but I have yet to make absolutely sure.
> >>>
> >>> Made absolute sure. I had been worrying that I was overloading one of
> >>> the rails on the PSU, but it turns out that it isn't a multi 12v rail
> >>> PSU after all. The box and advertising says it is, but the electronics
> >>> inside all say its a single 12v rail device.
> >>>
> >>>> [snip]
> >>>>
> >>>> After the mvsas update in 2.6.35 this started happening to me as well;
> >>>> at least its better than the previous state - not working.. ;-)
> >>>> However, after rolling a new 2.6.35 with the following fix that is
> >>>> queued up for the upcoming 2.6.35 and 2.6.36 stable releases, they
> >>>> seem to have dissapeared - 3 days and counting.
> >>>>
> >>>> http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=bl
> >>>> o b_ pl
> >>>> ain;f=queue-2.6.33/libsas-fix-ncq-mixing-with-non-ncq.patch;h=b6d7c920
> >>>> 9 4 d95 ad67a3b23c2e09c25d4fbd0f46b;hb=HEAD
> >>>>
> >>>> The fix is queued up for the next 2.6.36 and 2.6.35 stable
> >>>> point-releases.
> >>>
> >>> Ahah. I wonder how I missed that when I first read it. I'll have to
> >>> give the stable .36 kernel a try. Thanks!
> >>
> >> No fix so far:
> >>
> >> [ 2539.040104] drivers/scsi/mvsas/mv_sas.c 1703:<7>mv_abort_task()
> >> mvi=ffff880222f00000 task=ffff88018b3e2980 slot=ffff880222f265d0
> >> slot_idx=x2 [ 2539.040118] drivers/scsi/mvsas/mv_sas.c
> >> 1632:mvs_query_task:rc= 5 [ 2539.040154] drivers/scsi/mvsas/mv_sas.c
> >> 2083:port 7 ctrl sts=0x89800. [ 2539.040163] drivers/scsi/mvsas/mv_sas.c
> >> 2085:Port 7 irq sts = 0x1001001 [ 2539.040176]
> >> drivers/scsi/mvsas/mv_sas.c 2111:phy7 Unplug Notice [ 2539.050220]
> >> drivers/scsi/mvsas/mv_sas.c
>
> The controller is reporting a phy ready state change, which is why you see
> the unplug notice.
>
> Can you enable SCSI_SAS_LIBSAS_DEBUG and see if libsas reports anything
> before the abort?
>
> You should be able to turn on in your kernel config:
>
> Device Drivers
> SCSI device support
> SCSI Transports
> Compile the SAS Domain Transport Attributes in debug mode

Hi, I've done as you requested.

here's all of the output from the first (and currently only) event:

[ 1428.000080] sas: command 0xffff880184ed1680, task 0xffff88017a0f2680, timed out: BLK_EH_NOT_HANDLED
[ 1428.080051] sas: command 0xffff880224e03880, task 0xffff88017a0f24c0, timed out: BLK_EH_NOT_HANDLED
[ 1428.080077] sas: Enter sas_scsi_recover_host
[ 1428.080085] sas: trying to find task 0xffff88017a0f2680
[ 1428.080092] sas: sas_scsi_find_task: aborting task 0xffff88017a0f2680
[ 1428.080102] drivers/scsi/mvsas/mv_sas.c 1703:<7>mv_abort_task() mvi=ffff880224040000 task=ffff88017a0f2680 slot=ffff880224066680 slot_idx=x4
[ 1428.080113] sas: sas_scsi_find_task: querying task 0xffff88017a0f2680
[ 1428.080119] drivers/scsi/mvsas/mv_sas.c 1632:mvs_query_task:rc= 5
[ 1428.080125] sas: sas_scsi_find_task: task 0xffff88017a0f2680 failed to abort
[ 1428.080130] sas: task 0xffff88017a0f2680 is not at LU: I_T recover
[ 1428.080135] sas: I_T nexus reset for dev 0000000000000000
[ 1428.080172] drivers/scsi/mvsas/mv_sas.c 2083:port 0 ctrl sts=0x89800.
[ 1428.080180] drivers/scsi/mvsas/mv_sas.c 2085:Port 0 irq sts = 0x1001
[ 1428.080193] drivers/scsi/mvsas/mv_sas.c 2111:phy0 Unplug Notice
[ 1428.090228] drivers/scsi/mvsas/mv_sas.c 2083:port 0 ctrl sts=0x199800.
[ 1428.090236] drivers/scsi/mvsas/mv_sas.c 2085:Port 0 irq sts = 0x1081
[ 1428.111954] drivers/scsi/mvsas/mv_sas.c 2083:port 0 ctrl sts=0x199800.
[ 1428.111962] drivers/scsi/mvsas/mv_sas.c 2085:Port 0 irq sts = 0x10000
[ 1428.111969] drivers/scsi/mvsas/mv_sas.c 2138:notify plug in on phy[0]
[ 1428.146351] drivers/scsi/mvsas/mv_sas.c 1224:port 0 attach dev info is 20004
[ 1428.146351] drivers/scsi/mvsas/mv_sas.c 1226:port 0 attach sas addr is 0
[ 1428.222044] drivers/scsi/mvsas/mv_sas.c 378:phy 0 byte dmaded.
[ 1428.222109] sas: sas_form_port: phy0 belongs to port0 already(1)!
[ 1430.300028] drivers/scsi/mvsas/mv_sas.c 1586:mvs_I_T_nexus_reset for device[0]:rc= 0
[ 1430.300040] sas: I_T 0000000000000000 recovered
[ 1430.300048] sas: sas_ata_task_done: SAS error 8d
[ 1430.300059] ata9: translated ATA stat/err 0x01/04 to SCSI SK/ASC/ASCQ 0xb/00/00
[ 1430.300883] ata9.00: device reported invalid CHS sector 0
[ 1430.300888] ata9: status=0x01 { Error }
[ 1430.300894] ata9: error=0x04 { DriveStatusError }
[ 1430.300950] sas: trying to find task 0xffff88017a0f24c0
[ 1430.300956] sas: sas_scsi_find_task: aborting task 0xffff88017a0f24c0
[ 1430.300963] sas: sas_scsi_find_task: task 0xffff88017a0f24c0 is done
[ 1430.300968] sas: sas_eh_handle_sas_errors: task 0xffff88017a0f24c0 is done
[ 1430.300974] sas: sas_ata_task_done: SAS error 8d
[ 1430.300982] ata12: translated ATA stat/err 0x01/04 to SCSI SK/ASC/ASCQ 0xb/00/00
[ 1430.301777] ata12.00: device reported invalid CHS sector 0
[ 1430.301782] ata12: status=0x01 { Error }
[ 1430.301788] ata12: error=0x04 { DriveStatusError }
[ 1430.301808] sas: --- Exit sas_scsi_recover_host

Thanks.

> Thanks,
> David
>
> >> 2083:port 7 ctrl sts=0x199800. [ 2539.050229]
> >> drivers/scsi/mvsas/mv_sas.c 2085:Port 7 irq sts = 0x1001081 [
> >> 2539.071157] drivers/scsi/mvsas/mv_sas.c 2083:port 7 ctrl sts=0x199800.
> >> [ 2539.071165] drivers/scsi/mvsas/mv_sas.c 2085:Port 7 irq sts =
> >> 0x10000 [ 2539.071173] drivers/scsi/mvsas/mv_sas.c 2138:notify plug in
> >> on phy[7] [ 2539.081142] drivers/scsi/mvsas/mv_sas.c 1224:port 7 attach
> >> dev info is 5000002 [ 2539.081142]
> >> drivers/scsi/mvsas/mv_sas.c 1226:port 7 attach sas addr is 7 [
> >> 2539.081142] drivers/scsi/mvsas/mv_sas.c 378:phy 7 byte dmaded.
> >> [ 2541.270047] drivers/scsi/mvsas/mv_sas.c 1586:mvs_I_T_nexus_reset for
> >> device[5]:rc= 0 [ 2541.270066] ata14: translated ATA stat/err 0x01/04 to
> >> SCSI SK/ASC/ASCQ 0xb/00/00 [ 2541.270926] ata14: status=0x01 { Error }
> >> [ 2541.271747] ata14: error=0x04 { DriveStatusError }
> >>
> >> That appeared after about 42 minutes of uptime.
> >
> > So after about 32 hours of uptime theres been 36 separate events. Each
> > spits out similar messages as above, and each comes with a noticeable
> > pause while the drive is reset.
> >
> > There are a number of possible reasons that I'm still having issues:
> > - I managed to mess up the git checkout
> > - My problem isn't related to the fix
> > - The fix doesn't cover all cases of the problem it meant to fix
> >
> > I'm not certain which of them it is, I'd be more inclined to think I
> > messed up the checkout, as I did patch something in, but the patches
> > were completely unrelated and shouldn't have affected the scsi or ata
> > systems at all. At this point I'm just grasping at straws.
> >
> > In case my card is somehow different than expected, I'll paste the lspci
> > info for it: (AOC-SASLP-MV8)
> >
> > 04:00.0 SCSI storage controller: Marvell Technology Group Ltd.
> > MV64460/64461/64462 System Controller, Revision B (rev 01)
> >
> > Subsystem: Super Micro Computer Inc Device 0500
> > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> > ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz-
> > UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort-
> > >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes
> > Interrupt: pin A routed to IRQ 19
> > Region 2: I/O ports at df00 [size=128]
> > Region 4: Memory at fdef0000 (64-bit, non-prefetchable)
> > [size=64K] [virtual] Expansion ROM at fdd00000 [disabled]
> > [size=256K] Capabilities: [48] Power Management version 2
> >
> > Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA
> > PME(D0+,D1+,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst-
> > PME-Enable- DSel=0 DScale=1 PME-
> >
> > Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> >
> > Address: 0000000000000000 Data: 0000
> >
> > Capabilities: [e0] Express (v1) Legacy Endpoint, MSI 00
> >
> > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
> > unlimited, L1 unlimited
> >
> > ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
> >
> > DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
> > Unsupported-
> >
> > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > MaxPayload 128 bytes, MaxReadReq 2048 bytes
> >
> > DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr-
> > TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x4,
> > ASPM L0s, Latency L0 <256ns, L1 unlimited
> >
> > ClockPM- Surprise- LLActRep- BwNot-
> >
> > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
> > CommClk+
> >
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> >
> > LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
> > DLActive- BWMgmt- ABWMgmt-
> >
> > Capabilities: [100 v1] Advanced Error Reporting
> >
> > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
> > UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk:
> > DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> > RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+
> > SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+
> > MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr+ BadTLP-
> > BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr-
> > BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- AERCap:
> > First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
> >
> > Kernel driver in use: mvsas
> >
> > Its installed in a Phenom II X4 810 based system with a 790FX/SB750
> > chipset, 8G DDR3 1333 RAM, 6 1TB Seagate 7200.12 SATAII drives connected
> > to the card via sas->sata breakout cables, and a couple 4 drive SATA
> > hotswap bays. There are also two Seagate 7200.12 500G drives hooked up
> > to the motherboard SATA controller. The system is powered via an Antec
> > Neopower Blue 650W PSU which is probably only half loaded. System also
> > has a discreet gfx card, but its a low end, low profile, fanless card
> > that takes up next to no power.
> >
> > I'm still willing to help test any fixes for the mvsas driver on this
> > card.
> >
> > Thank you.


--
Thomas Fjellstrom
[email protected]